This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data.dataset in kaggle
Exploring the dataset
knitr::kable(head(df_fitness))
| 1503960366 |
4/12/2016 |
13162 |
8.50 |
8.50 |
0 |
1.88 |
0.55 |
6.06 |
0 |
25 |
13 |
328 |
728 |
1985 |
| 1503960366 |
4/13/2016 |
10735 |
6.97 |
6.97 |
0 |
1.57 |
0.69 |
4.71 |
0 |
21 |
19 |
217 |
776 |
1797 |
| 1503960366 |
4/14/2016 |
10460 |
6.74 |
6.74 |
0 |
2.44 |
0.40 |
3.91 |
0 |
30 |
11 |
181 |
1218 |
1776 |
| 1503960366 |
4/15/2016 |
9762 |
6.28 |
6.28 |
0 |
2.14 |
1.26 |
2.83 |
0 |
29 |
34 |
209 |
726 |
1745 |
| 1503960366 |
4/16/2016 |
12669 |
8.16 |
8.16 |
0 |
2.71 |
0.41 |
5.04 |
0 |
36 |
10 |
221 |
773 |
1863 |
| 1503960366 |
4/17/2016 |
9705 |
6.48 |
6.48 |
0 |
3.19 |
0.78 |
2.51 |
0 |
38 |
20 |
164 |
539 |
1728 |
str(df_fitness)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
knitr::kable(summary(df_fitness))
|
Min. :1.504e+09 |
Length:940 |
Min. : 0 |
Min. : 0.000 |
Min. : 0.000 |
Min. :0.0000 |
Min. : 0.000 |
Min. :0.0000 |
Min. : 0.000 |
Min. :0.000000 |
Min. : 0.00 |
Min. : 0.00 |
Min. : 0.0 |
Min. : 0.0 |
Min. : 0 |
|
1st Qu.:2.320e+09 |
Class :character |
1st Qu.: 3790 |
1st Qu.: 2.620 |
1st Qu.: 2.620 |
1st Qu.:0.0000 |
1st Qu.: 0.000 |
1st Qu.:0.0000 |
1st Qu.: 1.945 |
1st Qu.:0.000000 |
1st Qu.: 0.00 |
1st Qu.: 0.00 |
1st Qu.:127.0 |
1st Qu.: 729.8 |
1st Qu.:1828 |
|
Median :4.445e+09 |
Mode :character |
Median : 7406 |
Median : 5.245 |
Median : 5.245 |
Median :0.0000 |
Median : 0.210 |
Median :0.2400 |
Median : 3.365 |
Median :0.000000 |
Median : 4.00 |
Median : 6.00 |
Median :199.0 |
Median :1057.5 |
Median :2134 |
|
Mean :4.855e+09 |
NA |
Mean : 7638 |
Mean : 5.490 |
Mean : 5.475 |
Mean :0.1082 |
Mean : 1.503 |
Mean :0.5675 |
Mean : 3.341 |
Mean :0.001606 |
Mean : 21.16 |
Mean : 13.56 |
Mean :192.8 |
Mean : 991.2 |
Mean :2304 |
|
3rd Qu.:6.962e+09 |
NA |
3rd Qu.:10727 |
3rd Qu.: 7.713 |
3rd Qu.: 7.710 |
3rd Qu.:0.0000 |
3rd Qu.: 2.053 |
3rd Qu.:0.8000 |
3rd Qu.: 4.782 |
3rd Qu.:0.000000 |
3rd Qu.: 32.00 |
3rd Qu.: 19.00 |
3rd Qu.:264.0 |
3rd Qu.:1229.5 |
3rd Qu.:2793 |
|
Max. :8.878e+09 |
NA |
Max. :36019 |
Max. :28.030 |
Max. :28.030 |
Max. :4.9421 |
Max. :21.920 |
Max. :6.4800 |
Max. :10.710 |
Max. :0.110000 |
Max. :210.00 |
Max. :143.00 |
Max. :518.0 |
Max. :1440.0 |
Max. :4900 |
- Checking for duplicate and null values
noquote(paste(sum(duplicated(df_fitness)), "duplicate, GREAT"))
## [1] 0 duplicate, GREAT
noquote(paste(sum(is.null(df_fitness)), "Null value, SUPER"))
## [1] 0 Null value, SUPER
- How many users we have in the dataset?
noquote(paste(length(unique(df_fitness$Id)), "Users"))
## [1] 33 Users
- How many non-zero value in Logged Activities Distance column?
non_zero <- sum(df_fitness$LoggedActivitiesDistance != 0)
total <- length(df_fitness$Id)
percentage <- (non_zero/total)*100
noquote(paste(round(percentage,2), "%"))
## [1] 3.4 %
Since it only 3% of the column is available and we have other informative features than “LoggedActivitiesDistance” I will ignore this column.
Q1:How much active the users using fitbit tracker?
- create a new dataframe having ID and Total_steps_mean and activity level
User_activity_level <- df_fitness %>%
group_by(Id) %>%
summarise(Total_steps_mean=mean(TotalSteps))
knitr::kable(head(User_activity_level))
| 1503960366 |
12116.742 |
| 1624580081 |
5743.903 |
| 1644430081 |
7282.967 |
| 1844505072 |
2580.065 |
| 1927972279 |
916.129 |
| 2022484408 |
11370.645 |
- Categorize users into 5 activity levels based on total steps
- Sedentary
- Low active
- Somewhat active
- Active
- Highly active
User_activity_level <- User_activity_level %>%
mutate(activity_level = case_when(
Total_steps_mean < 2000 ~ "sedentary",
Total_steps_mean >= 2000 & Total_steps_mean < 7500 ~ "low active",
Total_steps_mean >= 7500 & Total_steps_mean < 10000 ~ "somewhat active",
Total_steps_mean >= 10000 & Total_steps_mean < 12500 ~ "active",
Total_steps_mean >= 12500 ~ "highly active",
))
- Visualize the different user activity category
Majority of users are low active type, that lead us to the next question what are the factors of low activity level?
Q2: what could affect activity levels? is the weekday, month have a influnce on users?
let’s see the if the day have an influnce on users activity?
From the graph we can see that weekend have drop in total steps and sum activity, which means that weekdays have an affect on the users activity level.
other question could be raised here is there a prefered hours to exersice?
- By useing Hourly Intensities dataset we can answer this:
df_Intensities <- read.csv(file = '~/Documents/My_website/content/post/2022-02-09-fitbit-fitness-tracker-analysis/data/hourlyIntensities_merged.csv')
knitr::kable(head(df_Intensities))
| 1503960366 |
4/12/2016 12:00:00 AM |
20 |
0.333333 |
| 1503960366 |
4/12/2016 1:00:00 AM |
8 |
0.133333 |
| 1503960366 |
4/12/2016 2:00:00 AM |
7 |
0.116667 |
| 1503960366 |
4/12/2016 3:00:00 AM |
0 |
0.000000 |
| 1503960366 |
4/12/2016 4:00:00 AM |
0 |
0.000000 |
| 1503960366 |
4/12/2016 5:00:00 AM |
0 |
0.000000 |
df_Intensities$ActivityHour <- strptime(df_Intensities$ActivityHour,format="%m/%d/%Y %H:%M:%S")
df_Intensities$Time <- format(as.POSIXct(df_Intensities$ActivityHour,format="%Y:%m:%d %H:%M:%S"),"%H:%M:%S")
df_Intensities$Date <- as.Date(df_Intensities$ActivityHour, format = "%m:%d:%Y")
str(df_Intensities)
## 'data.frame': 22099 obs. of 6 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : POSIXlt, format: "2016-04-12 12:00:00" "2016-04-12 01:00:00" ...
## $ TotalIntensity : int 20 8 7 0 0 0 0 0 13 30 ...
## $ AverageIntensity: num 0.333 0.133 0.117 0 0 ...
## $ Time : chr "12:00:00" "01:00:00" "02:00:00" "03:00:00" ...
## $ Date : Date, format: "2016-04-12" "2016-04-12" ...
From the plot it show that users tend to be more active in the hours range 5-7 much more early moring and in the evening
Q3: Is there any improvement in the users weight or average caleroies per day?
- By using weight lig info dataset
df_weight <- read.csv(file = '~/Documents/My_website/content/post/2022-02-09-fitbit-fitness-tracker-analysis/data/weightLogInfo_merged.csv')
# I will subset the dataset because there is redundent column like weight in pound
df_weight = subset(df_weight, select = -c(WeightPounds, Fat, LogId))
knitr::kable(head(df_weight))
| 1503960366 |
5/2/2016 11:59:59 PM |
52.6 |
22.65 |
True |
| 1503960366 |
5/3/2016 11:59:59 PM |
52.6 |
22.65 |
True |
| 1927972279 |
4/13/2016 1:08:52 AM |
133.5 |
47.54 |
False |
| 2873212765 |
4/21/2016 11:59:59 PM |
56.7 |
21.45 |
True |
| 2873212765 |
5/12/2016 11:59:59 PM |
57.3 |
21.69 |
True |
| 4319703577 |
4/17/2016 11:59:59 PM |
72.4 |
27.45 |
True |
- Formatting the date column and creating month column
df_weight <- df_weight %>%
mutate(Date=as.Date(Date, format = "%m/%d/%Y"))
df_weight <- df_weight %>%
mutate(month = format(df_weight$Date, "%m"))
str(df_weight)
## 'data.frame': 67 obs. of 6 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : Date, format: "2016-05-02" "2016-05-03" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ month : chr "05" "05" "04" "04" ...
- Create a new dataset that hold average weight for each user during the two month April and May
weight_monthly_data <- df_weight %>%
group_by(month, Id)%>%
summarise(Weight_mean= mean(WeightKg), BMI_mean=mean(BMI))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
knitr::kable(head(weight_monthly_data))
| 04 |
1927972279 |
133.50000 |
47.54000 |
| 04 |
2873212765 |
56.70000 |
21.45000 |
| 04 |
4319703577 |
72.40000 |
27.45000 |
| 04 |
4558609924 |
70.00000 |
27.35500 |
| 04 |
5577150313 |
90.70000 |
28.00000 |
| 04 |
6962181067 |
61.54445 |
24.02389 |
- Visualizing users average weight and BMI during April and May, some users only have a record for the their weight and BMI in one month only

From the plot we can see that there is no big improvement in the users weight neither their BMI.
- Let’s see users if there any increase or decrease in calories between two month
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
From the plot we can see that April have higher measure than May it could be due to the weather.